17 research outputs found
DORT: Modeling Dynamic Objects in Recurrent for Multi-Camera 3D Object Detection and Tracking
Recent multi-camera 3D object detectors usually leverage temporal information
to construct multi-view stereo that alleviates the ill-posed depth estimation.
However, they typically assume all the objects are static and directly
aggregate features across frames. This work begins with a theoretical and
empirical analysis to reveal that ignoring the motion of moving objects can
result in serious localization bias. Therefore, we propose to model Dynamic
Objects in RecurrenT (DORT) to tackle this problem. In contrast to previous
global Bird-Eye-View (BEV) methods, DORT extracts object-wise local volumes for
motion estimation that also alleviates the heavy computational burden. By
iteratively refining the estimated object motion and location, the preceding
features can be precisely aggregated to the current frame to mitigate the
aforementioned adverse effects. The simple framework has two significant
appealing properties. It is flexible and practical that can be plugged into
most camera-based 3D object detectors. As there are predictions of object
motion in the loop, it can easily track objects across frames according to
their nearest center distances. Without bells and whistles, DORT outperforms
all the previous methods on the nuScenes detection and tracking benchmarks with
62.5\% NDS and 57.6\% AMOTA, respectively. The source code will be released
PointLLM: Empowering Large Language Models to Understand Point Clouds
The unprecedented advancements in Large Language Models (LLMs) have created a
profound impact on natural language processing but are yet to fully embrace the
realm of 3D understanding. This paper introduces PointLLM, a preliminary effort
to fill this gap, thereby enabling LLMs to understand point clouds and offering
a new avenue beyond 2D visual data. PointLLM processes colored object point
clouds with human instructions and generates contextually appropriate
responses, illustrating its grasp of point clouds and common sense.
Specifically, it leverages a point cloud encoder with a powerful LLM to
effectively fuse geometric, appearance, and linguistic information. We collect
a novel dataset comprising 660K simple and 70K complex point-text instruction
pairs to enable a two-stage training strategy: initially aligning latent spaces
and subsequently instruction-tuning the unified model. To rigorously evaluate
our model's perceptual abilities and its generalization capabilities, we
establish two benchmarks: Generative 3D Object Classification and 3D Object
Captioning, assessed through three different methods, including human
evaluation, GPT-4/ChatGPT evaluation, and traditional metrics. Experiment
results show that PointLLM demonstrates superior performance over existing 2D
baselines. Remarkably, in human-evaluated object captioning tasks, PointLLM
outperforms human annotators in over 50% of the samples. Codes, datasets, and
benchmarks are available at https://github.com/OpenRobotLab/PointLLM .Comment: 19 pages. Empowering large language models with 3D point cloud
understanding, accompanied by a novel dataset and carefully designed
benchmarks. Project page: https://runsenxu.com/projects/PointLL
QDTrack: Quasi-Dense Similarity Learning for Appearance-Only Multiple Object Tracking
Similarity learning has been recognized as a crucial step for object
tracking. However, existing multiple object tracking methods only use sparse
ground truth matching as the training objective, while ignoring the majority of
the informative regions in images. In this paper, we present Quasi-Dense
Similarity Learning, which densely samples hundreds of object regions on a pair
of images for contrastive learning. We combine this similarity learning with
multiple existing object detectors to build Quasi-Dense Tracking (QDTrack),
which does not require displacement regression or motion priors. We find that
the resulting distinctive feature space admits a simple nearest neighbor search
at inference time for object association. In addition, we show that our
similarity learning scheme is not limited to video data, but can learn
effective instance similarity even from static input, enabling a competitive
tracking performance without training on videos or using tracking supervision.
We conduct extensive experiments on a wide variety of popular MOT benchmarks.
We find that, despite its simplicity, QDTrack rivals the performance of
state-of-the-art tracking methods on all benchmarks and sets a new
state-of-the-art on the large-scale BDD100K MOT benchmark, while introducing
negligible computational overhead to the detector
Unified Human-Scene Interaction via Prompted Chain-of-Contacts
Human-Scene Interaction (HSI) is a vital component of fields like embodied AI
and virtual reality. Despite advancements in motion quality and physical
plausibility, two pivotal factors, versatile interaction control and the
development of a user-friendly interface, require further exploration before
the practical application of HSI. This paper presents a unified HSI framework,
UniHSI, which supports unified control of diverse interactions through language
commands. This framework is built upon the definition of interaction as Chain
of Contacts (CoC): steps of human joint-object part pairs, which is inspired by
the strong correlation between interaction types and human-object contact
regions. Based on the definition, UniHSI constitutes a Large Language Model
(LLM) Planner to translate language prompts into task plans in the form of CoC,
and a Unified Controller that turns CoC into uniform task execution. To
facilitate training and evaluation, we collect a new dataset named ScenePlan
that encompasses thousands of task plans generated by LLMs based on diverse
scenarios. Comprehensive experiments demonstrate the effectiveness of our
framework in versatile task execution and generalizability to real scanned
scenes. The project page is at https://github.com/OpenRobotLab/UniHSI .Comment: A unified Human-Scene Interaction framework that supports versatile
interactions through language commands.Project URL:
https://xizaoqu.github.io/unihsi/ . Code:
https://github.com/OpenRobotLab/UniHS
Context-Aware Mixup for Domain Adaptive Semantic Segmentation
Unsupervised domain adaptation (UDA) aims to adapt a model of the labeled
source domain to an unlabeled target domain. Existing UDA-based semantic
segmentation approaches always reduce the domain shifts in pixel level, feature
level, and output level. However, almost all of them largely neglect the
contextual dependency, which is generally shared across different domains,
leading to less-desired performance. In this paper, we propose a novel
Context-Aware Mixup (CAMix) framework for domain adaptive semantic
segmentation, which exploits this important clue of context-dependency as
explicit prior knowledge in a fully end-to-end trainable manner for enhancing
the adaptability toward the target domain. Firstly, we present a contextual
mask generation strategy by leveraging the accumulated spatial distributions
and prior contextual relationships. The generated contextual mask is critical
in this work and will guide the context-aware domain mixup on three different
levels. Besides, provided the context knowledge, we introduce a
significance-reweighted consistency loss to penalize the inconsistency between
the mixed student prediction and the mixed teacher prediction, which alleviates
the negative transfer of the adaptation, e.g., early performance degradation.
Extensive experiments and analysis demonstrate the effectiveness of our method
against the state-of-the-art approaches on widely-used UDA benchmarks.Comment: Accepted to IEEE Transactions on Circuits and Systems for Video
Technology (TCSVT